Estimation of Processing Time of an API

Learn to estimate the processing time of an API.

In the previous lesson, we learned that the response time is a combination of latency and processing time, as given in the following equation:

Timeresponse=Timelatency+TimeprocessingTime_{response} = Time_{latency} + Time_{processing}

Let's start by estimating the processing time of an API.

Processing time#

The processing time of a server is defined as the time a server takes to process a request to prepare a response. This is one of the important factors that affect response time. Therefore, estimating processing time is an important part of estimating the total response time of a service.

The illustration below is a high-level architecture of what constitutes processing time in an API. The server interacts with the database to execute queries for data retrieval that might also involve file handling. It includes the round trip from the API gateway to downstream services, the request execution time, and the response preparation time.

Processing time of a request (API gateway to downstream services)
Processing time of a request (API gateway to downstream services)

There is no rule of thumb to calculate the exact processing time. It depends on several things, like the services, the components within the services, and the technologies (both hardware and software). Usually, the processing involves analyzing a query and fetching the data from the server’s memory or corresponding database. The processing time will primarily depend on three factors that are listed below:

  • The type of request

  • The application server’s time to handle a request

  • Database query execution time

The processing time depends on the machine’s specification, which is processing the user’s request. There are plenty of servers available with different specifications supporting different requirements. We’ll consider a typical server from Amazon Web Services (AWS) whose specifications are defined below:

Server Specifications

Component

Specification

Sockets

2

Processor

Intel Xeon X2686

RAM

240 GB

Cores

36 cores (72 hardware threads)

Cache (L3)

45 MB

Storage

15 TB

Request processing estimation#

In this section, we’ll estimate the time a server takes to handle a request depending on the type of request. Mainly, there are two types of requests that are bound by either CPU or memory.

  • CPU bound: These are requests where the CPU acts as a limiting factor.

  • Memory bound: These are requests where the memory acts as a limiting factor.

Let's say that each CPU-bound request takes 200 milliseconds (ms), and each memory-bound request takes 50 ms to complete. The requests per second (RPS) for each are calculated using the following formulas.

RPSCPU=NumCPU1TasktimeRPS_{CPU} ​ = Num_{CPU} * \frac{1}{Task_{time}} ​

The following terms are used in this calculation:

  • RPSCPURPS_{CPU}: The CPU bounded request per seconds

  • NumCPUNum_{CPU}: The number of hardware threads (CPU threads)

  • TasktimeTask_{time}: The time each task takes to complete

RPSCPU=72×1200 ms=360 RPSRPS_{CPU} = 72 \times \frac{1}{200\ ms} = 360 \ RPS

Similarly, for memory bound, if each worker consumes 300 MB of memory, RPS is calculated as:

RPSmemory=RAMsizeWorkermemory1TasktimeRPS_{memory} ​ = \frac{RAM_{size}}{Worker_{memory}} * \frac{1}{Task_{time}} ​

The following terms are used in this calculation:

  • RPSmemoryRPS_{memory}: The memory-bound request per seconds

  • RAMsizeRAM_{size}: The total size of the RAM

  • WorkermemoryWorker_{memory}: A worker in memory that manages a request

RPSmemory=240GB300MB×150ms=16,000 RPSRPS_{memory} = \frac{240GB}{300MB} \times \frac{1}{50ms} = 16,000 \ RPS

If we consider half the requests are memory bound and half are CPU bound, then the average RPS would be:

RPS=3602+16,0002=81808000 RPSRPS = \frac{360}{2} + \frac{16,000}{2} = 8180 \approx 8000 \ RPS

Considering the calculations above, the system takes approximately 0.125 ms (1/80001 / 8000) to handle each request.

Quiz

Question

Let’s consider a system having 72 cores (144 hardware threads) with 128 GB RAM. Each CPU-bound request takes 100 ms, and each memory-bound request takes 70 ms. How many requests per second (RPS) does each system handle if each worker consumes 200 MB of memory?

Hide Answer

Requests per second handled by CPU are calculated as:

RPS(CPU)RPS (CPU) = 144×1/100 ms144 \times 1/100\ ms = 1440 RPS1440\ RPS

Requests per second handled by memory are calculated as:

RPSmemoryRPSmemory = (128 GB/200 MB)×1/70 ms(128\ GB/200\ MB) \times1/70\ ms = 9,142 RPS9,142\ RPS

Query execution time#

The latency incurred due to database queries is significant because data retrieval is a time-consuming task. Therefore, the database query execution time should be as fast as possible. Filesystem, database, system/machine-level, and distributed cache, all types of caching help greatly reduce the query execution time.

Let's see an example of how we can measure the time it takes to execute a query. We’ll use MySQL as the database type because it is widely used in the industry. Query executions are measured in the following way:

A depiction of how query execution latency can be measured on MySQL server

Note: The technique above is referred to as profiling. In general, an INSERT query takes between 0.16 to 3 ms, whereas a SELECT query takes approximately 0.13 to 2 ms to execute on MySQL server.

This query execution time includes memory access time as well. Depending on the query, it either writes data to memory while saving it to a database or reads it from memory. This query execution time is for an optimized database with an optimized structure and relationship within the defined AWS.

Estimating processing time#

From the previous two sections, we have identified that processing time is dependent on the application server’s computation and database query handling time. However, network latency between these servers is also a key factor. In this section, we’ll learn how the location of these communicating components affects the overall processing time. We aim to define a range with minimum and maximum processing time for entertaining a simple user request. Using that as a basis, we can determine the plausibility of a practical system.

Let’s take a look at the slides below to see how we estimate the processing time of a simple user request:

Created with Fabric.js 3.6.6
The processing time required by the API gateway to forward a request to a service

1 of 6

Created with Fabric.js 3.6.6
The request traveling time to propagate to service A

2 of 6

Created with Fabric.js 3.6.6
Service A processing the request

3 of 6

Created with Fabric.js 3.6.6
Network latency to send the request to the database

4 of 6

Created with Fabric.js 3.6.6
The query execution time at the database

5 of 6

Created with Fabric.js 3.6.6
The time required for the response to reach back to the API gateway

6 of 6

We take the summation of all the latency and computation times to obtain the following processing time:

Processing time=2 ms (sum of network latencies)+2 ms (sum of all processings)=4 msProcessing\ time = 2\ ms\ (sum\ of\ network\ latencies) + 2\ ms\ (sum\ of\ all\ processings) = 4\ ms

In the slides above, we estimated a computation time of 0.125 ms for a server to handle a request (derived above) and estimated 1.5 ms as the average time to handle a database query (based on the average times estimated in the previous section). We assumed 0.5 ms as the propagation time of the query assuming the servers are in the same data center. However, the processing time varies if the service components are located at different locations, which eventually affects the response time. The change in processing time for different scenarios is depicted in the following slides:

Created with Fabric.js 3.6.6
The API gateway processing a request within a data center

1 of 4

Created with Fabric.js 3.6.6
The API gateway processing a request within a zone

2 of 4

Created with Fabric.js 3.6.6
The API gateway processing a request within a region (inter-zone)

3 of 4

Created with Fabric.js 3.6.6
The API gateway processing a request between two regions

4 of 4

From the slides above, we can see that the processing time of a simple user request can vary greatly. In reality, user requests are complex, so much so that the API gateway can make calls to multiple services to compile a response to the request. In that case, depending on the type of query, two types of communication are possible:

  • Parallel communication: In modern applications, an ideal case is the API gateway communicates simultaneously with all the downstream services. Each service performs the computation in parallel to the other services and provides the results as soon as they are available. This approach saves time and is desired when feasible.

  • Serial communication: The other scenario is when the API gateway communicates serially with all the available services. In this case, the processing time would be the sum of all the processing times taken by individual services. Serial communication is often a requirement when one service depends on another to generate its result.

Parallel vs. serial processing from API gateway to downstream services
Parallel vs. serial processing from API gateway to downstream services

Note: In the illustration above, the API gateway processes the requests in a single step to all services in parallel processing (step 1), whereas in serial processing (steps 1, 2, and 3), the API gateway processes the requests in three steps, one after another.

Discussion#

The processing time calculated above is for rudimentary querying or storing data in a database. We performed an estimation that is based on an ideal scenario. In the practical world, several factors can affect the overall processing time of a request. Some of the factors are listed below:

  • The time required for a file storage service will be significantly higher because we’ll be storing the file in different locations, processing it into chunks, extracting its metadata, and storing the corresponding metadata.

  • The processing time may also vary depending on the operations each downstream service needs to perform to process the request. Even in parallel processing, the time service A takes to compute its result will be different from that of service B on most occasions.

  • In real applications, the service is provided from the nearest locations (zone and region) to minimize the response time, particularly in backups, for disaster recovery. This possible inter-zonal communication is another factor affecting the time.

  • Sometimes, data processing needs intensive computations, like encryption, big data analytics, encoding, and so on, which also affect the processing time.

Other latency-increasing factors may include errors, network or device failures, path or machine resolution operations like hashing, and so on.

Quiz#

Let’s test your knowledge of latency numbers with the following quiz. You’ll need to match the correct answer by clicking an option in the left column, then clicking its corresponding latency number in the right column.  This is not a test of memory, just an exercise to see how you solve it.

Match The Answer
Select an option from the left-hand side

Region-region communication takes…

…5 ms

Communication within a region takes…

…100 ms

Communication within the same datacenter takes…

…10 ms

Communication between two data centers within a region takes…

…0.5 ms


In the next lesson, we’ll estimate the latency of an API.

Introduction to Response Time in APIs

Estimation of Latency of an API